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Abstract 

We study the problem of estimating multiple predictive functions from a dictionary of basis functions 
in the nonparametric regression setting. Our estimation scheme assumes that each predictive function 
can be estimated in the form of a linear combination of the basis functions. By assuming that the 
coefficient matrix admits a sparse low-rank structure, we formulate the function estimation problem as 
a convex program regularized by the trace norm and the £i-norm simultaneously. We propose to solve 
the convex program using the accelerated gradient (AG) method and the alternating direction method 
of multipliers (ADMM) respectively; we also develop efficient algorithms to solve the key components in 
both AG and ADMM. In addition, we conduct theoretical analysis on the proposed function estimation 
scheme: we derive a key property of the optimal solution to the convex program; based on an assumption 
on the basis functions, we establish a performance bound of the proposed function estimation scheme 
(via the composite regularization). Simulation studies demonstrate the effectiveness and efficiency of the 
proposed algorithms. 



1 Introduction 

We study the problem of estimating multiple predictive functions from noisy observations. Such a problem 
has received broad attention in many areas of statistics and machine learning [BJ [HJ HH HH] ■ This line of 
work can be roughly divided into two categories: parametric estimation and non-parametric estimation; a 
common and important theme for both categories is the appropriate assumption of the structure in the 
model parameters (parametric setting) or the coefficients of the dictionary (nonparametric setting). 

There has been an enormous amount of literature on effective function estimation based on different 
sparsity constraints, including the estimation of the sparse linear regression via £i-norm penalty [3> 6, 27ll32|. 
and the estimation of the linear regression functions using group lasso estimator [151 116] . More recently, 
trace norm regularization has become a popular tool for approximating a set of linear models and the 
associated low-rank matrices in the high-dimensional setting [T5J [32] ; the trace norm is the tightest convex 
surrogate [12j for the (non-convex) rank function under certain conditions, encouraging the sparsity in the 
singular values of the matrix of interest. One limitation of the use of trace norm regularization is that the 
resulting model is dense in general. However, in many real- world applications |21) . the underlying structure of 
multiple predictive functions may be sparse as well as low-rank; the sparsity leads to explicitly interpretablc 
prediction models and the low-rank implies essential subspace structure information. Similarly, the £i-norm 
is the tightest convex surrogate for the non-convex cardinality function [5], encouraging the sparsity in the 
entries of the matrix. This motivates us to explore the use of the combination of the trace norm and the 
£i-norm as a composite regularization (called sparse trace norm regularization) to induce the desirable sparse 
low-rank structure. 

"This work was done when the first author was a Ph.D. student at Arizona State University. 
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Trace norm regularization (minimization) has been investigated extensively in recent years. Efficient 
algorithms have been developed for solving convex programs with trace norm regularization [291112) ; sufficient 
conditions for exact recovery from trace norm minimization have been established in [22] : consistency of 
trace norm minimization has been studied in [T]; trace norm minimization has been applied for matrix 
completion [7] and collaborative filtering 25, 23 . Similarly, £i-norm regularization has been well studied 
in the literature, just to mention a few, from the efficient algorithms for convex optimization |11[ I13[ [29], 
theoretical guarantee of the performance [9, 32], and model selection consistency [33] . 

In this paper, we focus on estimating multiple predictive functions simultaneously from a finite dictionary 
of basis functions in the nonparametric regression setting. Our function estimation scheme assumes that each 
predictive function can be approximated using a linear combination of those basis functions. By assuming 
that the coefficient matrix of the basis functions admits a sparse low-rank structure, we formulate the 
function estimation problem as a convex formulation, in which the combination of the trace norm and the 
^-norm is employed as a composite regularization to induce a sparse low-rank structure in the coefficient 
matrix. The simultaneous sparse and low-rank structure is different from the incoherent sparse and low-rank 
structures studied in [5] IIP] . We propose to solve the function estimation problem using the accelerated 
gradient method and the alternating direction method of multipliers; we also develop efficient algorithms 
to solve the key components involved in both methods. We conduct theoretical analysis on the proposed 
convex formulation: we first present some basic properties of the optimal solution to the convex formulation 
(Lemma 14. lj) : we then present an assumption associated with the geometric nature of the basis functions 
over the prescribed observations; based on such an assumption, we derive a performance bound for the 
combined regularization for function estimation (Theorem 14. 1[) . We conduct simulations on benchmark data 
to demonstrate the effectiveness and efficiency of the proposed algorithms. 

Notation Denote N n = {1, • • • ,n}. For any matrix O, denote its trace norm by ||0||*, i.e., the sum of 
the singular values; denote its operator norm by ||6||2, i-e., the largest singular value; denote its £i-norm by 
|| 6 ||i, i.e., the sum of absolute value of all entries. 



2 Problem Formulation 

Let {(xi,yi), • • • , (x n , y n )} C R d x R k be a set of prescribed sample pairs (fixed design) associated with k 
unknown functions {/i, • • • , fk} as 

Vij = fj + mj , i e N n , j e N fe , ( 1 ) 

where fj : R d — > K is an unknown regression function, j/y denotes the j-th entry of the response vector yi € 
R k , and Wij ~ A/"(0, a^) is a stochastic noise variable. Let X = [xi, ■ ■ ■ , x n ] T € M nx<i , Y — [yi, ■ ■ ■ , y n ] T G 
R nxk , and W = (ioy)ij e R nxk . Denoting 

F=(fj(x t )) ltJ eR nxk , ieN n ,jeN fc , (2) 

we can rewrite Eq. {l} in a compact form as Y = J- + W. Let {g\, ■ ■ ■ ,gh} be a set of h pre-specified basis 
functions as gt : R d — > R, and let 6 = [#i, • • • ,9k] &R hxk be the coefficient matrix. We define 

h 

9j( x ) =^2°ij9i{x), jeN k , (3) 

i=l 

where 9ij denotes the i-th entry in the vector 9j. Note that in practice the basis functions {g{\ can be 
estimators from different methods, or different values of the tuning parameters of the same method. 

We consider the problem of estimating the unknown functions {/i, • • ■ , fk} using the composite functions 
{<7i, ■ • • ,gk} defined in Eq. ([3]), respectively. Denote 

g x = (g J (x i )) ij ER nx \ ieN ni j6N kl (4) 
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and define the empirical error as 



E E (ate) - v^f^^WQxQ Yf F , (5) 

where N = n x fc. Our goal is to estimate the model parameter of a sparse low-rank structure from the 
given n sample pairs {(xi, Such a structure induces the sparsity and the low rank simultaneously in 

a single matrix of interest. 

Given that the functions {/i, • • • , //-} are coupled via in some coherent sparse and low-rank structure, 
we propose to estimate as 

= argmin(,S(0) + a||0||»+/3||0|| 1 ) , (6) 

where a and j3 are regularization parameters (estimated via cross-validation) , and the linear combination of 
||0||* and ||0||i is used to induce the sparse low-rank structure in 0. The optimization problem in Eq. ([6]) is 
non-smooth convex and hence admits a globally optimal solution; it can be solved using many sophisticated 
optimization techniques [28l[T2j; in Section [3] we propose to apply the accelerated gradient method [19] and 
the alternating direction method of multipliers 4 to solve the optimization problem in Eq. ([6]). 



3 Optimization Algorithms 

In this section, we consider to apply the accelerated gradient (AG) algorithm [2] HO] and the alternating 
direction method of multipliers (ADMM) [4 , respectively, to solve the (non-smooth and convex) optimization 
problem in Eq. (JB)). We also develop efficient algorithms to solve the key components involved in both AG 
and ADMM. 



3.1 Accelerated Gradient Algorithm 

The AG algorithm has attracted extensive attention in the machine learning community due to its optimal 
convergence rate among all first order techniques and its ability of dealing with large scale data. The general 
scheme in AG for solving Eq. ([6]) can be described as below: at the fc-th iteration, the intermediate (feasible) 
solution Ok can be obtained via 



0fe = argmin 



e- $ fc — vs($ fc ) 

7fe 
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/3||6| 



(7) 



where denotes a searching point constructed on the intermediate solutions from previous iterations, 
VS($k) denotes the derivative of the loss function in Eq. ([5]) at and jk specifies the step size which can 
be determined by iterative increment until the condition 



S(G k ) < 5(* fc ) + (V/($ fc ), O fc - $a 



file, 
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is satisfied. The operation in Eq. is commonly referred to as proximal operator [17], and its efficient 
computation is critical for the practical convergence of the AG-type algorithm. Next we present an efficient 
alternating optimization procedure to solve Eq. with a given 7^. 



3.1.1 Dual Formulation 

The problem in Eq. (J7J is not easy to solve directly; next we show that this problem can be efficiently solved 
in its dual form. By reformulating ||0||* and ||0|i into the equivalent dual forms, we convert Eq. ([7]) into a 
max-min formulation as 

maxmin ||0 - %\\ 2 F + a(L,Q) +/3(S, 0), subject to ||L|| 2 < 1, \\S\\oo < 1, (8) 
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where <& = 5>fc - VS($>k)/lk, a = 2a /^ k , and /3 = 2(3 /j k - It can be verified that in Eq. ([5} the Slater 
condition is satisfied and strong duality holds 5 . Also the optimal O can be expressed as a function of L 
and 5 given by 

9 = $ - -(aL+pS). (9) 
By substituting Eq. © into Eq. ([5]), we obtain the dual form of Eq. as 

min \\aL + fiS -2$\\ 2 F , subject to ||L|| 2 < 1, \\S\\oo < 1- (10) 



3.1.2 Alternating Optimization 

The optimization problem in Eq. (|10[) is smooth convex and it has two optimization variables. For such type 
of problems, coordinate descent (CD) method is routinely used to compute its globally optimal solution [14 . 
To solve Eq. (fT0| , the CD method alternatively optimizes one of the two variables with the other variable 
fixed. Our analysis below shows that the variables L and S in Eq. (fT0|) can be optimized efficiently. Note 
that the convergence rate of the CD method is not known, however, it converges very fast in practice (less 
than 10 iterations in our experiments). 

Optimization of L For a given S, the variable L can be optimized via solving the following problem: 

min \\L - L\\ F , subject to ||L|| 2 < 1, (11) 

where L = (2$ — j3S) /a. The optimization on L above can be interpreted as computing an optimal projection 
of a given matrix over a unit spectral norm ball. Our analysis shows that the optimal solution to Eq. 
can be expressed in an analytic form as summarized in the following theorem. 

Theorem 3.1. For arbitrary L G M. hxk in Eq. 177]) , denote its SVD by L — ITE,V T , where r — rank(L), 
U G R hxr , V G R kxr , and S = diag((Ti, ■ ■ ■ ,oy) e W xr . Let a* = min (er u 1) , i = 1, • • • ,r. Then the 
optimal solution to Eq. Ill]) is given by 

L* = UtV T , S= diag ,a* r ). (12) 

Proof. Assume the existence of a set of left and right singular vector pairs shared by the optimal L* to 
Eq. (jlip and the given L for their non-zero singular values. Under such an assumption, it can be verified 
that the singular values of L* can be obtained via 

min (<7j — (Ti) 2 , subject to < <Xj < 1, i = 1, • • • ,r, 
{*;} 

to which the optimal solution is given by a* = mm(<7j, 1) (Vz); hence the expression of L* coincides with 
Eq. fTS]) . Therefore, all that remains is to show that our assumption (on the left and right singular vector 
pairs of L* and L) holds. 

Denote the Lagrangian associated with the problem in Eq. (TTT1) as h(L,X) = \\L — L|||- + A(||i||2 — 1), 
where A denotes the dual variable. Since is strictly feasible in Eq. (ITT|) . namely, ||0||2 < 1, strong duality 
holds for Eq. (fTTj) . Let A* be the optimal dual variable to Eq. (ITT1) . Therefore we have L* = argmin^ h(L, A*). 
It is well known that L* minimizes h(L, A*) if and only if is a subgradient of h(L, A*) at L* , i.e., 

Oe 2(L*-L) + A*d||L*|| 2 . (13) 

For any matrix Z, the subdifferential of ||Z|| 2 is given by [31] <9||i? || 2 = conv {u z v^ : \\u z \\ = \\v z \\ = 1, Zv z = ||Z|| 2 u z } , 
where conv{c} denotes the convex hull of the set c. Specifically, any element of <9||Z|| 2 has the form 

ajUgjvJi, oti > 0, J^ai = 1, 
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where u Z i and v Z i are any left and right singular vectors of Z corresponding to its largest singular value (the 
top singular values may share a common value). From Eq. (|13l) and the definition of <9||Z||2, there exist {a{\ 
such that a, > 0, J2i^i = h Hi^i u u v Ii e d||L*|| 2 , and 

i 

where uu and vj^ correspond to any left and right singular vectors of L* corresponding to its largest singular 
value. Since A*,di > 0, Eq. (fT4f verifies the existence of a set of left and right singular vector pairs shared 
by L* and L. This completes the proof. □ 

Optimization of S For a given L, the variable S can be optimized via solving the following problem: 

rain \\S - S\\ 2 F , subject to 1151100 < 1, (15) 

where S — (2$ — aL) / j3. Similarly, the optimization on S can be interpreted as computing a projection of a 
given matrix over an infinity norm ball. It also admits an analytic solution as summarized in the following 
theorem. 



Lemma 3.1. For any matrix S, the optimal solution to Eq. U5\) is given by 

S* = sgn(S)omm{\S\,l), (16) 

where o denotes the component-wise multiplication operator, and 1 denotes the matrix with entries 1 of 
appropriate size. 

3.2 Alternating Direction Method of Multipliers 

The ADMM algorithm [4] is suitable for dealing with non-smooth (convex) optimizations problems, as it 
blends the decomposability of dual ascent with the superior convergence of the method of multipliers. We 
present two implementations of the ADMM algorithm for solving Eq. ((6]). Due to the space constraint, we 
move the detailed discussion of two ADMM implementations to the supplemental material. 

4 Theoretical Analysis 

In this section, we present a performance bound for the function estimation scheme in Eq. (|3]). Such a 
performance bound measures how well the estimation scheme can approximate the regression functions {fj} 
in Eq. ([2]) via the sparse low-rank coefficient 0. 

4.1 Basic Properties of the Optimal Solution 

We first present some basic properties of the optimal solution defined in Eq. ([6]); these properties are 
important building blocks of our following theoretical analysis. 



Lemma 4.1. Consider the optimization problem in Eq. (OJ) for h, k > 2 and n > 1. Given n sample pairs as 
X = [xi, ■ ■ ■ ,x n } T E R nxd and Y = [y u ■ ■ ■ ,y n ] T G W nxk . Let T and Q x be defined in Eq. 0) and Eq. (|), 

identically distributed (i.i.d.) entries as Wij ~ A/"(0, cr.^). Take 



respectively; let 0x(i) be the largest singular values of Qx- Assume that W £ R nxfe has independent and 
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where N — n x k and t is a universal constant. Then with probability of at least 1 — exp (— nt 2 /2), for the 
minimizer O in Eq. (0) and any 6 S WL hxk , we have 

^\\Qx& -F\\%< jjWSxQ ~n 2 F + 2a||5 (6 - 6)||* + 2/3||(6 - e) J(e) ||i, (18) 
where So is an operator defined in Lemma [7] of the supplemental material. 

Proof. From the definition of 6 in Eq. ©, we have 5(6) + a||6||* + £||e||i < 5(6) + a||6||* + /3||6||i. By 
substituting Y = J- + W and Eq. (|5|) into the previous inequality, we have 

^ll^e-^|||.<i||axe-^|||+l(w;sx(e-e))+a(||e||,-||e||.) +/3(||e|| 1 -||e|| 1 ) . 



Define the random event 



A = l~\\(&Wh<2±£}. iv.n 



Taking a + (3 as the value in Eq. (|f 71) , it follows from Lemma [3] of the supplemental materia that A holds 
with probability of at least 1 — exp (—nt 2 /2). Therefore, we have 

(W, g x (e-e)) = (W, Q x (6 - 6)) 
a + p 

<^ll^w||2||e-e|u + -^- 5 ||a^|| 0o ||e-e|| 1 <^ ( a ||e - eiu + /?||e - 

a + p a + p 2 \ 

where the second inequality follows from ||£r£VK||2 > ||^^ W|| oo- Therefore, under A, we have 

^115x6-^111 

< jfWQxQ- J 7 |||. + a||e-e||. + ^||e-e||i + a(||e||.- ||o|u) +/? (hoik - ||o||a 

From Corollary [T] and Lemma [5] of the supplemental material, we complete the proof. □ 
4.2 Main Assumption 

We introduce a key assumption on the dictionary of basis functions Qx ■ Based on such an assumption, we 
derive a performance bound for the sparse trace norm regularization formulation in Eq. (|6|). 

Assumption 4.1. For a matrix pair 6 and A of size h x k, let s < min(/i, k) and q < h X k. We assume 
that there exist constants K\(s) and k 2 ((?) such that 

, \ a • ||<5xA||_f / \ a • \\GxA\\f ^ n , on ^ 

Ki(s) = mm — = > 0, K2(q) = mm — t= > 0, (20) 

<\en(s, q ) y/N\\S (A)\\* AeK( s ,g) VF||Aj( e )||i 

where the restricted set lZ{s, q) is defined as 

n(s,q) = {A<ER hxk ,OeR hxk \A^0, rank(S (A)) < s, \ J(6)| < q} , 

and | J(6)| denotes the number of nonzero entries in the matrix 6. 

Our assumption on Ki(s) in Eq. (|2U)) is closely related to but less restrictive than the RSC condition 
used in |18) ; its denominator is only a part of the one in RSC and in a different matrix norm as well. Our 
assumption on K2{q) is similar to the RE condition used in [3] except that its denominator is in a different 
matrix norm; our assumption can also be implied by sufficient conditions similar to the ones in [3]. 
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4.3 Performance Bound 



We derive a performance bound for the sparse trace norm structure obtained by solving Eq. ([5]) . This bound 
measures how well the optimal can be used to approximate T by evaluating the averaged estimation error, 
i.e., \\Q X Q -Tfp/N. 

Theorem 4.1. Consider the optimization problem in Eq. (0|) for h 1 k > 2 and n > 1. Given n sample pairs 
as X = [xi, ■ ■ ■ , x n ] T £ R nxd and Y = [yi, ■ ■ ■ , y n ] T £ R nxk , let T and Q x be defined in Eqs. 0) and 
respectively; let o~x(i) be the largest singular value of Qx- Assume that W £ M. nxk has i.i.d. entries as 
Wij ~ A/"(0, crj). Take a + (3 as the value in Eq. {Ity . Then with probability of at least 1 — exp (— nt 2 /2), for 
the minimizer O in Eq. fSJ), we have 

h\g X Q - ni < a + e) Mlh\g X Q n%\ + m + -C,) > (21) 



N n * ~ y 1 e \N" u * J \ K i(^ r ) K i( c ) / ' 

where inf is taken over all Q £ M. hxk with rank(Q) < r and \J(Q) \ < c, and 8(e) > is a constant depending 
only on e. 

Proof. Denote A = 6 - 9 in Eq. JISJ). We have 

^\\GxQ -Hl< ^\\Gx® - Hi + 2a||5 (A)||, + 20||A., ( e)||i. (22) 

Given 5o(A) < 2r (from Lemma [T] of the supplemental material) and |J(0)| < c, we derive upper bounds 
on the components 2a||5o(A)||* and 2/3||Aj(e)||i over the restrict set H(2r,c) based on Assumptions I4.ll 
respectively. It follows that 

2a||5 (A)|u < ———\\g x (e-e)\\ F <— — — (\\g x e - f\\ f + \\g x e - r\\ 

^i(2r)VjV Ki(2r)v TV v 

where the last inequality above follows from 2a6 < a 2 r + 6 2 /t for r > 0. Similarly, we have 

2/3||A J(e) ||i < ^ + ±\\Q x §-rf F + J^ + ±\\Q x Q-rf p . (24) 
Substituting Eqs. ([23]) and ([24]) into Eq. ([22]). we have 

Setting r = 2 + 4/e and £(e) = 2(e + 2) 2 /e in the inequality above, we complete the proof. □ 

By choosing specific values for a and /3, we can refine the performance bound described in Eq. (|21j) . It 
follows from Eq. (fl"7]) we have 

a 2 /3 2 \ _ 7 2 _ ^x(i)cr w y/n ( , fk 



aXa+'/i=7 V>n(2r) ' « 2 (c)7 K 2 (2r) + K 2 (c)' 7 TV ^ 1 + Vn + *J' (25) 

where the equality of the first equation is achieved by setting a and j3 proportional to K\{2r) and ^(q), i.e., 
a = jK 2 (2r)/ (k 2 (2?-) + n\{c)) and (3 = 7^ 2 (c)/ (ft 2 (2r) + k|(c)). Thus the performance bound in Eq. (|2"T1) 
can be refined as 

illfee - Tfi < (i + «) * ( l|| & e - + Jgg&a* fi + Jf + «V ■ 



JV IPA -iii--v y e [iV 11 ^ "'J iV 2 ( K 2 (2r) + K 2 (c)) 

Note that the performance bound above is independent of the value of a and /3, and it is tighter than the 
one described in Eq. (|21[) . 
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Table 1: Averaged performance (with standard derivation) comparison in terms of AUC, Macro Fl, and Micro Fl. 
Note that n, d, and m denote the sample size, dimensionality, and label number, respectively. 



Data Set 
(n, d, m) 


Business 
(9968, 16621, 17) 


Arts 
(7441, 17973, 19) 


Health 
(9109, 18430, 14) 


Scene 
(2407,294,6) 


AUC 


S.TraccNorm 
TraccNorm 
OneNorm 


85.42 ± 0.31 

83.43 ± 0.41 
81.95 ± 0.26 


76.31 ± 0.15 
75.90 ±0.27 
70.47 ± 0.18 


86.18 ± 0.56 
85.24 ± 0.42 
83.60 ± 0.32 


91.54 ± 0.18 
90.33 ± 0.24 
88.42 ± 0.31 


Macro Fl 


S.TraccNorm 
TraccNorm 
OneNorm 


48.83 ± 0.13 
47.24 ± 0.15 
46.28 ± 0.25 


32.83 ± 0.25 
31.90 ± 0.31 
31.03 ± 0.46 


60.05 ± 0.36 
58.91 ± 0.24 
58.01 ± 0.18 


51.65 ± 0.33 
50.59 ± 0.08 
46.57 ± 1.10 


Micro Fl 


S.TraccNorm 
TraccNorm 
OneNorm 


78.26 ± 0.71 
78.84 ± 0.11 
78.16 ± 0.17 


42.91 ± 0.27 
42.08 ± 0.11 
40.64 ± 0.52 


67.22 ± 0.47 
66.92 ± 0.42 
66.37 ± 0.19 


52.83 ± 0.35 
52.06 ± 0.49 
47.32 ± 0.13 



5 Experiments 

In this section, we evaluate the effectiveness of the sparse trace norm regularization formulation in Eq. © 
on benchmark data sets; we also conduct numerical studies on the convergence of AG and two ADMM 
implementations including ADMM1 and ADMM2 (see details in Section E of the supplemental material) for 
solving Eq. ([6]) and the convergence of the alternating optimization algorithm for solve Eq. (p~0|) . Note that 
we use the least square loss for the following experiments. 

Performance Evaluation We apply the sparse trace norm regularization formulation (S.TraceNorm) on 
multi-label classification problems, in comparison with the trace norm regularization formulation (Tra- 
ceNorm) and the ^i-norm regularization formulation (OneNorm). AUC, Macro Fl, and Micro Fl are used 
as the classification performance measures. Four benchmark data sets, including Business, Arts, and Health 
from Yahoo webpage data sets [30] and Scene from LIBSVM multi-label data set^j], are employed in this 
experiment. The reported experimental results are averaged over 10 random repetitions of the data sets into 
training and test sets of the ratio 1 : 9. We use the AG method to solve the S.TraceNorm formulation, and 
stop the iterative procedure of AG if the change of the objective values in two successive iterations is smaller 
than 10 -8 or the iteration numbers larger than 10 5 . The regularization parameters a and /3 are determined 
via double cross-validation from the set {10 -2 x U {10 _1 x i}J£ 2 U{2x «}i=i- 

We present the averaged performance of the competing algorithms in Table [T] The main observations are 
summarized as follows: (1) S.TraccNorm achieves the best performance on all benchmark data sets (except 
on Business data) in this experiment; this result demonstrates the effectiveness of the induced sparse low-rank 
structure for multi-label classification tasks; (2) TraceNorm outperforms OneNorm on all benchmark data 
sets; this result demonstrates the effectiveness of modeling a shared low-rank structure for high-dimensional 
text and image data analysis. 

Numerical Study We study the practical convergence of AG and ADMM2 by solving Eq. © on Scene 
data. In our experiments, we observe that ADMM1 is much slower than ADMM2 and we thus only focus 
on ADMM2. Note that in AG, we set a = l,/3 = 1; in ADMM2, we set a = 1, p = 1, p x = p 2 = 10. For 
other parameter settings, we observe similar trends. 

In the first experiment, we compare AG and ADMM2 in term of the practical convergence. We stop 
ADMM2 when the change of the objective values in two successive iterations smaller than 10~ 4 ; the attained 
objective value in ADMM2 is used as the stopping criterion for AG, that is, we stop AG if the attained 
objective value in AG is equal to or smaller than that objective value attained in ADMM2. The convergence 
curves of ADMM2 and AG are presented in the left plot of Figure Q] Clearly, we can observe that AG 
converges much faster than ADMM2. In the second experiment, we study the convergence of AG. We 
stop AG when the change of the objective values in two successive iterations smaller than 10~ 8 . The 
convergence curves is presented in the middle plot of Figure [TJ We observe that AG converges very fast, and 
its convergence speed is consistent with the theoretical convergence analysis in [19] . 

We also conduct numerical study on the alternating optimization algorithm (in Section f3.1.2[) for solving 
the dual formulation of the proximal operator in Eq. (|10[) . Similarly, the alternating optimization algorithm 

1 http : / / www . cs ie . ntu . ec hi . tw/~ c j lin| 
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Figure 1: Convergence comparison of AG and ADMM2 for solving Eq. ([6]) (left plot); convergence plot of AG for 
solving Eq. ([6]) (middle plot); and the alternating optimization algorithm for solving the dual formulation of the 
proximal operator in Eq. (|10p (right plot). 



is stopped when the change of the objective values in two successive iterations smaller than 10~ 8 . For 
illustration, in Eq. (fT0|) we randomly generate the matrix $ of size 10000 by 5000 from A^(0, 1); we then 
apply the alternating optimization algorithm to solve Eq. (|10|) and plot its convergence curve in the right plot 
of Figure [T] Our experimental results show that the alternating optimization algorithm generally converges 
within 10 iterations and our results demonstrate the practical efficiency of this algorithm. 



6 Conclusion 



We study the problem of estimating multiple predictive functions simultaneously in the nonparametric 
regression setting. In our estimation scheme, each predictive function is estimated using a linear combination 
of a dictionary of pre-specffied basis functions. By assuming that the coefficient matrix admits a sparse low- 
rank structure, we formulate the function estimation problem as a convex program with the trace norm 
and the ^i-norm regularization. We propose to employ AG and ADMM algorithms to solve the function 
estimation problem and also develop efficient algorithms for the key components involved in AG and ADMM. 
We derive a key property of the optimal solution to the convex program; moreover, based on an assumption 
associated with the basis functions, we establish a performance bound of the proposed function estimation 
scheme using the composite regularization. Our simulation studies demonstrate the effectiveness and the 
efficiency of the proposed formulation. In the future, we plan to derive a formal sparse oracle inequality for 
the convex problem in Eq. ^ as in [3] ; we also plan to apply the proposed function estimation formulation 
to other real world applications. 
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Sparse Trace Norm Regular izat ion: Supplemental Material 



A. Operators Sq and S 



i 



We define two operators, namely So and Si, on an arbitrary matrix pair (of the same size) based on Lemma 3.4 
in [22], as summarized in the following lemma. 

Lemma 1. Given any and A of size h x k, let ranfc(O) = r and denote the SVD of as 



e = u 



E 




v 1 , 



where U G R x and V G R x are orthogonal, and E G R rxr is diagonal consisting of the non-zero singular values 
on its main diagonal. Let 

A = U T AV = [ | 11 | 12 
A21 A22 

where A u G R rxr , A 12 G R rx(fc - r) , A 2 i G R (h - r)xr , and A 22 G RC l "'') x ( fe -'-). De/ine 5 and 5j as 



5 (e,A) = J7 



An A12 
A21 



v T , Si(e,A) = u 





A22 



V 1 



Then the following conditions hold: rank (5o(6, A)) < 2r, 6Si(e, A) T = 0, T 5i(O, A) = 0. 

The result presented in Lemma Q] implies a condition under which the trace norm on a matrix pair is additive. 
From Lemma [1] we can easily verify that 



|e + Si(e,A)||. = yen. + ||5i(e,A)||. 



(26) 



for arbitrary O and A of the same size. To avoid clutter notation, we denote So(©, A) by 5o(A), and 5i(6, A) by 
Si (A) throughout this paper, as the appropriate O can be easily determined from the context. 

B. Bound on Trace Norm 

As a consequence of Lemma [T] we derive a bound on the trace norm of the matrices of interest as summarized below. 
Corollary 1. Given an arbitrary matrix pair and O, let A = — O. Then 

||e-e||* + ne||, <2||5 (A)||.. 

Proof. From Lemma[T]we have A = 5o(A) + 5i(A) for the matrix pair and A. Moreover, 



(27) 



||9||. = ||0 + So(A)+Si(A)||» > ||e + 5i(A)||. - ||5 (A)||, 
= ||e||. + ||5i(A)||.-||5 (A)||., 

where the inequality above follows from the triangle inequality and the last equality above follows from Eq. ((26 
Using the result in Eq. (|27p . we have 

lie - e||. + ||e||. - ||e||, < ||A||. + ||e||. - ||e||, - ||5i(a)||. + ||5 (A)||. 

< 2||5 (A)||.. 



We complete the proof of this corollary. 



□ 



1 



C. Bound on £i-norm 



Analogous to the bound on the trace norm in Corollary [TJ we also derive a bound on the ^i-norm of the matrices of 
interest in the following lemma. For arbitrary matrices O and A, we denote by J(6) = {(i, j)} the coordinate set 
(the location set of nonzero entries) of 6, and by J(6)± the associated complement (the location set of zero entries); 
we denote by Aj(e) the matrix of the same entries as A on the set J(6) and of zero entries on the set J(0)±. We 
now present a result associated with J(O) and J(Q)± in the following lemma. Note that a similar result for the 
vector case is presented in [3]. 

Lemma 2. Given a matrix pair and of the same size, the inequality below always holds 

lie - 0||i + neiK - neiK < 2||e J(e) - e J(e) ||i. (28) 

Proof. It can be verified that the inequality 

||e J( e)||i-|l§.rce)l|i< IK© -©)j(e)IU 

and the equalities 

e J( e )x =Q, ||(6- e) J(0) Ji -||e J( e)||i = 

hold. Therefore we can derive 

||§-e||i + ||e||i- ||e||i 

= ||(6 - 6) J( e)||i + ||(6 - e)j ( e )x ||i + l|6 J( e)||i + ||6 J( e) i ||i - ||§j ( e)||i - l|6 J( eul|i 
< 2||(e-6) J( e)||i. 

This completes the proof of this lemma. □ 

D. Concentration Inequality 

Lemma 3. Let o~xm be the maximum singular value of the matrix Qx £ W ixh ; let W £ R" xfe be the matrix of i.i.d 
entries as Wij ~ A/"(0, a%). Let A = 2a X (i)0- w ^/n ^1 + \fkfn + t \ /N. Then 

Pr (\\W T Gxh/N < A/2) > l-exp(-nt 2 /2) • 

Proof. It is known [26] that a Gaussian matrix W € M nxk with n > k and Wij ~ A/"(0, 1/n) satisfies 

Pr (\\W\\2 > 1 + \fkjn + ij < exp (-nt 2 /2) , (29) 

where t is a universal constant. From the definition of the largest singular value, there exist a vector b G R h of length 
1, i.e., ||6|| a = 1, such that ||W T e?^|| 2 = \\W T g x b\\ 2 < ||W|| 3 ||0xi>||a < a X {j) II W\\ 2 . Since Wij / {a wS fn) ~ Jf(0, 1/n), 
we have 

Pr (||^ T e;x| 2 /N > A/2) < Pr (a X (i) \\W\\ 2 /N > A/2) . 
Applying the result in Eq. (|2"9")l into the inequality above, we complete the proof of this lemma. □ 



E. Implementations of the Alternating Direction Method of Mul- 
tipliers for Solving Eq. ([6]) 

We employ two variants of the Alternating Direction Method of Multipliers ( ADMM) to solve the Eq. (0 . The key 
difference lies in the use of different numbers of auxiliary variables to separate the smooth components from the 
non-smooth components of the objective function in Eq. ©. 
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E.l The First Implementation: ADMM1 

By adding an auxiliary variable *, we reformulate Eq. ((6]l as 

min 5(e) + a||*]L + fl|ie||i 
e,* 

subject to 9 = *. (30) 
The augmented Lagrangian of Eq. (|30p can be expressed as 

4(6, = S(Q) + a||¥||. +P\\B\\i + (6 - *,r> + |j|6 - *]||. (31) 
To solve Eq. (|30[) . ADMM1 consists of the following iterations: 

6 fc+ i = argmin4(e,* fc ,r fc ), (32) 

* fc+ i = argmin4(e fc+ i,*,r fc ), (33) 

r fe+ i = T k + p (9fc+i - , (34) 

where ®k, ^k, and Vk denote the intermediate solutions of ADMM1 at the fe-th iteration, and p is a pre-specified 
constant. 

Specifically, if we employ the least squares loss, i.e., S(@) = \\GxO — Y\\%/N, the optimization problems in 
Eqs. (|32|) and (|34[) can be efficiently solved as below. 

Update on O The optimal O^+i to Eq. (|32|) can be obtained via 

en+i=argngn (^110x8 -Y\\% + 0||0||i + <e,r fc ) + |||6 - , (35) 



which can be efficiently solved via the gradient-type methods [21 120], 
Update on * The optimal ^k+i to Eq. (|33|) can be obtained via 



* fc+ i = argnun - <*,r fc ) + |||6 fc+ i - *| 



The optimization problem above admits an analytical solution 22 . Assume rank (Qk+i + Lfc/p) = r. Let @h+i + 
Fk./p = Ur'ErVj' be the singular value decomposition of Ok+i + Tfe/p, where U r and V r consist of respectively r 
orthonormal columns, and E r = diag {(<ti, a<z, • • ■ , o>)}. Then the optimal ^k+i is given by 



<f k+1 = U r ZV r T , E = diag <J ( rr, - - ) } . i ■'!(. ) 

where (x)+ = x if x > and (x)+ = otherwise. 

E.2 The Second Implementation: ADMM2 

By adding two auxiliary variables ty 1 and ty 2 , we reformulate Eq. ([6j| as 

min S(9) + a||* 1 ||, +/3||* 2 ||i 

subject to e = f 1 , 9 = * 2 . (37) 
Similarly, the augmented Lagrangian of Eq. (|37[) can be expressed as 

£2 l , w (e,¥\¥ a ,r\r a ) 
= 5(6) + all* 1 !!, + /?ll* 2 lli + <e - ^r 1 ) + (e - * 2 ,r 2 ) + |^||e - + ^||e - * 2 |||. 



n 



3 



To solve Eq. (|3"7)) . ADMM2 consists of the following iterations: 

6 fc+1 = aigmm£ 2 n , P2 (eMM,rl,rl), (38) 

= ar s min c 2 P1 , P2 (Qk+i,*\y 2 ,rl,Tl), (39) 

\[/ 1 \J/ 2, 

r^ +1 = r\ + Pl (e k+1 - , (40) 
r 2 +1 = r 2 + P2 (e fe+1 -* 2 fc+1 ), (4i) 

where 0fc, 9%, T\, and r| denote the intermediate solutions at the fc-th iteration of the ADMM2 method. 

Specifically, if we employ S(Q) = \\QxO — Y\\ 2 P /N as the loss function in Eq. (|37[1 . the optimization problems in 
Eqs. (|3HJ, (O, (gOI, and fl4l} can be efficiently solved as below. 



Update on O The optimal Ofe+i to Eq. (|38[1 can be obtained via 

e fc+1 = argmm (^\\Gx& - Y|| J + (9, r£ + 1^} + ^- 1|6 - 1|| + ^||e-^||| 

Note that the optimal Ofc+i can be obtained via solving a systems of linear equations. 
Update on ty 1 and *I/ 2 The optimal and 'i't+i to Eq. (|39j) can be obtained via 

*Ui = argmin(a||* 1 ||»-(* 1 ,r^) + ^||e fc+1 -* 1 ||^, (42) 

* 2 fc+1 = argmin('/3||* 2 || 1 -(vl> 2 ,r 2 ) + ^||e fc+1 -vI/ 2 || 2 F ) . (43) 



It can be verified that Eq. (|42[) admits an analytical solution. Assume rank (Ok+i + T h /pi) = r. Let Qk+i +T\/pi = 
Ur'SrViF be the singular value decomposition of 0fc+i + T\/pi, where U r and V r consist of respectively r orthonormal 
columns, and S r = diag {(<ri, <72, ■ • • , err)}. Then the optimal is given by 

= UrtvJ, t = diag Ifai - — ) I , (44) 



Pi 



where (x)+ — x if x > and (»)+ = otherv 



Moreover, it can also be verified that Eq. (|43[) admits an analytical solution. Let ip, 9, and 7 be the entries of 
^l+i, Ofc+i, and F 2 at the same coordinates. The optimal if) is given by 

i>={ -±p<0 + ± 7 < j-f} . (45) 
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